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Abstract 

In biomedical studies, researchers are often interested in assessing the association between one or 
more ordinal explanatory variables and an outcome variable, at the same time adjusting for covari- 
ates of any type. The outcome variable may be continuous, binary, or represent censored survival 
times. In the absence of precise knowledge of the response function, using monotonicity constraints 
on the ordinal variables improves efficiency in estimating parameters, especially when sample sizes 
are small. An active set algorithm that can efficiently compute such estimators is proposed, and a 
characterization of the solution is provided. Having an efficient algorithm at hand is especially rele- 
vant when applying likelihood ratio tests in restricted generalized linear models, where one needs the 
value of the likelihood at the restricted maximizer. The algorithm is illustrated on a real life data set 
from oncology. 

Keywords: ordered explanatory variable, constrained estimation, least squares, logistic regression, Cox 
regression, active set algorithm, likelihood ratio test under linear constraints 



1 Introduction 

In many applied problems and especially in biomedical studies, researchers are interested in associating 
an outcome variable to several explanatory variables, typically via a generalized linear or proportional 
hazards regression model. Here, the explanatory variables or predictors may be continuous, nominal or 
ordered. Estimates of regression parameters can be obtained via maximizing a least-squares or (partial) 
likelihood function. Especially if the number of observations is small to moderate, researchers often 
encounter noisy estimates of the regression parameters, possibly leading to patterns in the regression 
estimates that violate the a-priori knowledge of a factor being ordered. In order to improve accuracy 
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of estimates and efficiency of overall tests for associations, it is tempting to use the prior knowledge of 
orderings in some of the regression coefficients. 

From a Bayesian perspective, receiving estimators in these type of problems is straightforward usin g 



Markov Chain Monte Carlo approaches. Pi oneered in a linear model f r amework by 



(1992), 



Gelfand et al. 

Bayes ian approaches have been proposed by lRobert and Hwangl(| 19961) : IDunson and NeelorJ ( 2003h : lDunson and Herring 
(|2003l) . We also refer to the discussion i n the latter two papers. T o use Gibbs sampling to g et the ordered 
predic tor estimate in logistic regression, iHolmes and Held (|20Qq) combine the approach in 



Gelfand et al 



(I1992T) with an auxiliary variable technique. Note that using e.g. flat priors on the regression coefficient 
vector /3 it is straightforward to show that the maximum a posteriori estimator is equal to the constrained 
MLE introduced in Section [2] 

Although conceptually straightforward, the implementation of these Bayesian approaches is not without 
fallacies. To not only get point estimates but also assess whether parameters are equal or strictly ordered 
across level of predictors, on e needs to borrow from mo re frequentist approaches and "isotonize" uncon- 
strained parameter estimates (IDunson and Neelonl.l2003l) . Only then one can accommodate "flat regions", 
i.e. successive estimates for ordered levels that are equal. 



Altho ugh there exists vast literature on frequentist estimation subject to order restrictions (IRobertson et al 



1988) , estimation in the specific regression model discussed here has gained surprisingly little attention 



(IMukerjee and Tul 1 19950 . This may be due to the fa ct that setting up algorithms in these type of prob- 
lems is generally difficult (IDunson and Neelonl . 120031) . and requires approaches that need to be adapted to 



mention Dvkstra and Robertson 


( 


1982 


); 


Matthews and Crowthei ( 


1998): Jamshidian (2004): 


Tan et al. 


(2007): Tavloretal. ( 


2007). or 


Balabdaoui et al. 


(2009) discussing computation of order restricted es- 


timates in specific regression problems, and 


Terlakv and Vial ( 1998 


): Balabdaoui and Wellner 


(2004) or 



Rufibachl (120071) for estimation of probability densities under order restrictions. Additionally, generaliza- 



tions of the pool-adiacent-vio l aters algorithm (PAVA) to i nclusion of continuous isoto nic covariates are 



discussed in 



Bacchettildl989l) : 



Morton- Jones et al. 



(120001) : lGhoshl (120071) : iChengl (120091) in the context of 
"additive isotonic regression". Estimation in this type of model is usually performed using the cyclical 
PAVA in connection with backfitting. However, note that we are not in this genuinely semiparametric 
setting, but rather the number of levels of an ordered factor is given a priori and remains fixed for any 
number of observations. 

Recen tly, a type of algorithm, which has been around in optimization theory for some de cades (Fletcher 



19871) . has gained considerable attention in the statistical literature: active set algorithms, biimbgen et al 



(|2007r) use and generalize such an algorithm to compute a log-concave density not only from i.i.d. but 
ev en from censored data. A n algorithm similar in spirit is the support reduction algorithm discussed 



in 



Groeneboom et al. 



(2008)- The latter authors apply it to the estimation of a convex density and to 
Gaussian deconvolution. A slight g eneralization of the support re d uction algorithm is used to e stimate 
a convex-shaped hazard function in Ijankowski and Wellnen (120091) . iBeran and Diimbgenl (120091) extend 
active set algorithms to the estimation of smooth bimonotone functions. They illustrate their algorithm 
on regression with two ordered covariates, so also treating the example dealt with in this paper. However, 
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Beran and Dumbgenl (120090 only consider least squares or least absolute deviation estimation, and at most 
two ordered factors. In this paper, we propose an algorithm for an arbitrary number of ordered factors, 
and we also provide a characterization of the solution. 

A key feature of an active set algorithm is, that although iterative, it terminates after finitely many steps, 
and that the solution is finally found via an unconstrained optimiza tion. This implicitly implies that, as 
opposed to some Bayesian approaches (|Dunson and Neelonl . I2003Q . the active set algorithm is not hurt 
if estimates of subsequent levels turn out to be equal. In Section |2] we show that the estimation of a 
regression function in generalized linear models (GLM) under the above ordered factor restriction can be 
easily performed using such an active set algorithm. 



Optimal scaling A reviewer drew our attention to optimal scaling, where one seeks to a ss ign numeri c 
values to categorical variabl es in some optimal way, see e.g. iBreiman an d Fried man! (119850: iGifil (119901): 



Hastie and Tibshirani 



Van Rosmalen et al 



(2009J). In 



Gifi 



(119901) . or applied to modeling interactions in 
(119901 . Section 2) categories of the original categorical variables are replaced by "category quantifica- 
tions", and from then on the variables are considered to be quantitative. Note that in the approach dis- 
cussed in this paper, one does not necessarily look for an optimal transformation, but rather imposes a 
priori knowledge on a given ordered predictor. In the example analyzed in Section [9] it seems plausible 
that a higher tumor or nodal stage is associated with a higher risk of experiencing a second primary tumor. 



Ordered predictors While the treatment of quantitative and grouped predictors in regression models 
is straightforward, we briefly review alternative approaches that can be applied to deal with an ordered 
explanatory variable z. Let us assume the levels of z are coded as 1, . . . , k where k > 2 and the levels are 
increasingly ordered, i.e. 1 < . . . < k. 

The most straightforward way to incorporate z as a predictor is simply to ignore the information about 
the groups and consider it a quantitative variable. This approach implicitly assumes that the group levels 
represent a true dimension, with intervals measured between adjacent categories that correspond to the 
chosen coding. If the ordinal values are arbitrarily assigned rather than actually measured, the regression 
coefficient is then difficult or impossible to interpret. 

Supposedly the most prevalent approach to incorporate an ordered predictor z in a regression model is to 
introduce k — 1 dummy variables Z2, ■ ■ ■ , Zk where z\ = l{z = i}, i = 2, . . . k. This approach ignores the 
additional knowledge of z having ordered levels, entailing that the estimated parameters fa, ■ ■ ■ ,(3k cor- 
responding to the above dummy variables may not be increasingly ordered. This is especially relevant in 
small sample studies, where noisy estimates may confuse the proper order of dummy variable coefficients. 
To simplify interpretation of models, especially when interactions are to be incorporated, researchers 
sometimes resort to dichotomizing a grouped factor, i.e. introducing only one dummy variable z\ = 
l{z < I}, for some 1 < I < k. Here, the additional k nowledge about th e ordered levels is not used and 
may cause a substantial loss of predictive information (ISteyerbergl . 120091. Section 9.1). 
Another choice may be polynomial contrasts. One then introduces new variables zi = i 2 {z = i},i = 
2, . . . , k. To avoid correlated estimators and therefore mutually dependent tests when doing variable 
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selection, researchers generally prefer to modif y the design matrix in order t o get orthogonal polynomial 



contrasts. The func t ion a s . ordered ( ) in R (|R Development Core Teaml. 120091) does this by default 



Gertheiss and Tuta (|2008h proposed a ridge-regression related approach to perform regression with or- 
dered factors. Consider the predictor z with ordered categories 1, . . . , k and the linear regression model 



A2Z2 + • 
Z/3 + e. 



+ AcZk + e 



(1) 



For simplicity, we do not consider an intercept and only one ordered factor. The vectors zj = (1{zi = 
jDjLj, j = 2, . . . , k are vectors of dummy variables corresponding to the levels of z, Z is the n x (h — 1) 
design matrix with the zj's as columns, y £ M" is the response and e an i.i.d. noise vector where Si ~ 
N(0,a 2 ). Note that for reasons of identifi ability, iGertheiss and Tutzl (I2008h assume /3\ = and therefore 
omit /3izi in (fl]). 



Instead of maximizing the original likelihood l{f3) over 6, IGertheiss and Tuta (12008b instead propose to 
maximize a penalized version of I: 



/C9)- A £03,-/3, 



j=2 



Here, A > is a tuning parameter. The solution to (0 can be explicitly computed as 

3gt = (Z T Z + An)- 1 Z T y 



(2) 



(3) 



for a fixed and specified matrix f2. The idea is that y is assumed to change slowly for adjacent categories, 
a property of /?gt that is "encouraged" by the shrinkage estimator ©. However, note that /3gt may still 
contain two adjacent estimates such that < ft+i, a somewhat undesired feature in this setting. 

Furthermore, if we choose j = 1 as our reference level (and therefore implicitly assume that f3\ = 0), it 
seems reasonable to demand for the estimated coefficients that they are all positive, what is not ensured 
by using ©. Finally, further considerations are necessary to determine the tuning parameter A. 
Consider Setting £T|) as before. In this paper, we introduce an algorithm to solve the following problem: 
Maximize t assuming that f3\ = and under the constraint that 



0</3 2 < ... 



(4) 



so that we receive non-negative and adequately ordered estimated parameters /?2 , • • • , Pk f° r tne factor 
levels. This approach is appealing since the available knowledge (or our "prior belief") is precisely 
exploited. Furthermore, constraining the space of allowed paramete rs can be interpreted as reg ularizing 
the estimator, implying higher accuracy of the constrained estimate (IDunson and Neelonl 120031) . This is 
especially relevant in small samples. As can be seen from ©, we can estimat e parameters for an ordered 
factor such that the constraints < /3 2 < . . . < /3fc are enforced, unlike in IGertheiss and Tuta (120081) 
where the violation of these inequalities is only penalized. In the latter approach the violation of the first 
of the above inequalities, the non-negativity constraint, is not even penalized. In addition, our estimator 
is fully automatic, i.e. no arbitrary choices such as the coding of levels, the determination of a cutoff to 
pool levels, or the selection of a tuning parameter (such as A above) or bandwidth are necessary. 
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Testing in order restricted models There is a vast literature on likelihood ratio testing in models under 
linear equality and inequality constraints. For a discus sion and further references on (exa ct) testing unde r 
restriction s in the ordi nary linear regression model see IPerlmanl (| 19691 ) , IWolakl (|1987l) and IShapird (| 19881 ) . 



Silvapulld (| 19941) and lFahrmeir and Klingerl (1 19941 ) generalize these results to generalized linear models, 



especially logistic and Cox regression. As can be seen from (fl4l) below, any likelihood ratio test (LRT) 
is constructed as the difference of the likelihoods at the unrestricted and the restricted maximizer of 
the (partial) log-likelihood fu nction, which entails that one needs an algorithm to compute the restricted 
maximizer. ISilvapulld (119941 . Section 4) describes an ad-hoc app roach to find the constrained estimators. 
However, his algorithm is non-standard and tedious to apply (ISilvapulld . 1 19941. p. 856). The active 
set algorithm described here is a general framework able to tackle general optimization problems under 
constraints and therefore able to compute the restricted estimators in the above mentioned tests very 
efficiently. This facilitates the application of LRTs in this type of problem. 



Statistical inference and asymptotics Typically, derivin g asymptotic properties of shape-constrained 



estimators is hard, but the starting poin t in all these problems (IGroeneboom et al. 



2001 



Balabdaoui and Wellner . 



20041 : iDiimbgen and Rufibachl l2009l) is a characterization of the estimator, since all the estimators are 



defined as maximizer of some rather involved function. The most prominent example of a theoretical 
treatment of a shape constrained estimator via its characterization is the greatest convex minorant that 
characterizes the estimator of a monotone density (IGrenanderj . 1 1956T) . In Section [6] we characterize the 
solution in our problem. Besides being the starting point for a more thorough analysis, a characterization 
also allows to check whether an algorithm actually delivers the correct solution. 



Our contribution We propose an active set algorithm to find estimators in GLMs with ordered predic- 
tors. The estimators strictly comply with the constraints and are found very efficiently, and in a finite 
number of steps. For identifiability reasons, most regression approaches assume that the coefficient cor- 
responding to the lowest level of an ordered factor is equal to 0. Our approach ensures that all coefficients 
corresponding to higher levels are in fact non-negative as well. In addition, neither the estimator nor 
the proposed algorithm needs a tuning parameter. Having an efficient algorithm at hand that provides 
restricted estimates facilitates the application of LRTs to check whether ordered predictors should be 
included in the model. In addition, we provide a characterization of the estimator. This serves (i) as 
a benchmark to verify that the algorithm indeed delivers the maximizer, (ii) gives some insight in the 
structure of the estimator and (iii) marks the starting point for a more thorough (asymptotic) analysis. 

Organization of the paper A general formulation of the problem is given in Section|2l Some examples 
of GLMs that illustrate our new approach are discussed in Section [3] A description of the active set 
algorithm adapted to our problem is given in Section|4l There exist special cases of the problem that allow 
one to find the linear regression estimator fi\ more easily than using the active set algorithm, discussed 
in Section [5j A characterization of the solutions is given in Section [6] Some indications on statistical 
inference are provided in Section [7] Literature on likelihood-ratio testing to check whether an ordered 
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factor should be included in the model is briefly discussed in Section[8] A real data example from oncology 
is analyzed in Section Finally, a more technical description of the algorithm and proofs are postponed 
to the Appendix. 

2 Setup 

We consider the general regression problem of modeling an outcome y G K based on some feature vector 
w G R p . Therefore, we are given a set (y,, {wij) p - =1 ) of observations, for i = 1, . . . , n. Write 

y = (yi)lliGR n and W = (wl)f =1 eR nxp 

where Wi. = (wij)P =1 , i = 1, . . . , n. The predictors are denoted by w.j = (wy)?^ for j = 1, . . . ,p. 
Throughout the exposition, n and p are considered to be fixed. 

In general, for given y and W, we seek to maximize a real-valued concave criterion function 

L = L(y, W, p) : R n x R nxp x R p R 

over j3 G RP, yielding an estimated parameter vector /3 G R p . Note that to define our estimator and 
to derive the characterization in Section [6j a model needs no further specification that goes beyond the 
function L. Ordinary, i.e. unordered, factors are assumed to be already coded as dummy variables, so 
they are considered quantitative. If an intercept is to be taken into the model, we simply assume it to be a 
quantitative variable of all l's. Let c denote the number of quantitative predictors and suppose that the last 
/ predictors w.j, j = c + 1, . . . ,p are ordered factors, each with kj levels (so c = p — /). Furthermore, 
the coding is assumed such that G {1, . . . , kj}, i = 1, . . . , n, where a higher number corresponds 
to a "higher" level of the ordered factor w.j. Introduce the sets of indices J cp = {c + 1, . . . ,p} and 
Cj = {2, . . . ,kj} for j G J c ^p. Clearly, the case c = (no quantitative variables in the model) is not 
excluded. However, we assume to have at least one ordered factor, i.e. / > 1 which immediately implies 
p > 1. In order to respect the ordinal character of each of the factors w.j we estimate (3 based on a new 
data matrix X G R d . This latter matrix is obtained via modifying the original data matrix W by adding 

/•((E **)-(p-<o) 

j=c+l 

dummy variables for the levels > 2 of the ordered factors. We then constrain optimization of the updated 
functional L = L(y, X, (3) to the constrained space of parameters 

B(c,p, k) = {/?GM d : /?j, 2 >0, ^j.i+i -/3j,i>0, 1 G £j, j G J c , p }. (5) 

Here, /3j i is the coefficient of the dummy variable corresponding to the level / of the j-th ordered factor, 
and k = ((0)? =1 , k c+ i, . . . , k p ) G RP. For ease of notation, we define B = £>(c, p, k). Constraining 
estimation to B ensures that the estimated parameter corresponding to a "higher" level of an ordered factor 
is at least as large as those of "lower" levels and all estimated parameters are non-negative. Note that our 
approach also adds something new if we have an ordered factor with only two levels (note that we always 
lose the level attributed to the baseline), namely that /3j ; 2 > for this ordered factor. 
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3 Examples 



We briefly specify the GLMs we provide algorithms for. Extensions to other criterion functions are 
straightforward. 

Linear regression Here, y € M n and we estimate j3 via maximizing the criterion function l n ^\ over all 
/3 E B. This latter function is defined as 

n 

lnA/3) = -5>,-X^) 2 . 
i=l 

Here, Xi. denotes the i-th row vector of X. We emphasize that given L there is no need to further specify 
a model for the data. 



Logistic regression In this case, y G {0, l} n . Using maximum likelihood estimation (MLE) we obtain 
the log-likelihood function 

n 

inM = -£(-MXil0 + log(l + exp(x,T /?)))• 
i=i 

Cox regression Here, we have observations (Tj, Cj, 5$, for i = 1, . . . , n. Clearly, Tj are the failure 
times (possibly unobserved), Cj the censoring times, <5j = 1 {event has happened} and Xi. is the feature 
vector as before. If we introduce the observed time V{ = min{T,-, Cj} for each unit, let 

Ri = {j ■ Vj > Ti} 

denote the number of individuals at risk after time Ti, i = 1, . . . ,rt. The partial likelihood according to 



Coxl (U972T) is then 

exp(x^) 



n[ 



i L E fee ^exp(x k ^) 

Introducing a>i = exp(x^/3) for i = 1, . . . , n and letting t\ < . . . < to be the observed (assumed to be 
distinct, for simplicity) event times, we then easily deduce the log-likelihood function: 



n n 

4,3(/3) = J>log(^a fc ) 

i=l i=l k£Ri 

D D 



s=X s=l k&R s 

where ar s \ is the above expression belonging to the s-th failure time, s = 1, . . . , D. 
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Properties of the maximization problems Let us introduce the constrained 



Pi := maximize £ n> j(/3), i = 1, 2, 3 



(6) 



and the unconstrained 



rfi := maximize i n i(P), i = l,2,3 
/3eK d 

maximizers. The conditions on fixed response y and design matrix X under which rj\ exist and ar e 
unique in logistic regre s sion a re well studied (lAlbert and Anderson! . 1 19841 : ISantner and Duffyl 1 1986T) . 
Silvapulle and Burridgd (119861) specify necessary and sufficient conditions for the MLE to exist in lo- 
gistic and Cox regression. Since the set B is a closed convex cone, the estimators Pi exist and are unique 
for i = 1, 2, 3 at least under the same condition s as those for rj-,. Conditions for consistency and asymptotic 
normality of MLEs in GLMs are provided in Fahrmeir and Kaufmannl (11985b . In this paper we assume 
that our design matrix X is such that t n ^ is concave and coercive for % = 1, 2, 3. 



4 Active set algorithm to compute (3i 



In Fletcher! dl987h an active set algorithm is described, a useful tool for constrai ned optimization p rob- 
lems. In connection with likelihood ratio tests (see Section [8]) we came across ISilvapulld (I1994T) . In 
Section 4 of this latter paper, it seems as if a version of the active set algorithm is described. However, 
instead of directly computing the "active set" in each iteration (see below), a crude and computationally 
expensive "all-subset sea rch" is proposed. In the context of mixture models, the algorithm discussed by 



Groeneboom et al. 



( 200 8J) can also be interpreted as a valiant of an active set algorithm. 



In Section 3 of iDiimbgen et all (120071) the general principle of active set algorithms is described in detail, 
complemented by a discussion of its validity. Here, we therefore limit ourselves to the discussion of the 
main features and points relevant for the application of the active set algorithm to find the Pi's. We briefly 
sketch the idea of an active set algorithm, and refer to|A]for a detailed technical exposition of the algorithm 
for the problem treated here. 

Let q denote the number of constraints that compose £>, I the function to be maximized and p its maxi- 
mizer, see Section [3] Define for any index set A C {1, . . . , qj the linear subspace 

V(A) = j/?GlR d : -/3 ji i + /3 ji i_il{l>3} = 0, for all j, 1 such that </>(j, 1) € a}. 

The function 4> maps the indices j and I of the dummy variables forming the ordered factors to the number 
of constraining inequalities, see ( fT5b inlAl The crucial assumption for an active set algorithm is that we 
have another algorithm available that for any A C {1, . . . , q} (efficiently) computes 



/3(A) 



arg max£(/3), 



provided that V{A) n {P : t{P) > — oo} ^ 0. Subspaces of the parameter space are considered when 
violations of the initial constraints appear in the algorithm. In this case, the active set algorithm varies 
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A in a deterministic way, until finally (3 (A.) = f3. In order to tailor an active set algorithm to a specific 
problem, the above maximization on a subspace is crucial. In our regression with ordered covariates 
setting, we show in [A] (see Table |4]) that three types of subspaces have to be dealt with, depending on the 
specific violation that occurs. 

It is important to realize that by design, the main routine of an active-set algorithm does not need a stop- 
ping criterion as e.g. Newton-type algorithms. Once the algorithm has identified the set A that corresponds 
to the solution (3, it performs an unrestricted maximization (here, a stopping criterion may be necessary), 
which at least in the linear, logistic and Cox regression examples is unproblematic. Ver ification that a 



given /3 is the maximizer can be done by means of Theorem 3.1 in lDiimbgen et all (120070 . Additionally, 



since there are only finitely many subsets of A, the algorithm terminates after finitely many steps. 

5 Special case: An almost explicit solution 

To be able to state the following results concisely, let us introduce for every ordered factor j £ J7" c p the 
set of indices where the equality constraint /3jj > is active: 

Zj(fk) = [lE{2,...,k j ] : Pj,i = 0} 

for all j G J C:P . 

In this section, we restrict our attention to the case of linear regression with only one ordinal predictor. If 
in addition Z\{fi\) = 0, that means the constrained estimator has only strictly positive entries anyway, 
then simplifies such that /3\ can be found via solving ©. 

Lemma 5.1. Ifc = 0,f = l and Zi{f3\) = 0, the estimator f3\ is 

3i = argmax £ n ^(/3) 
/3 2 >...>/3 fel 

fci 

= argmin Nj(f3j — mj) 2 (7) 
P2<...<0 kl 

where for I € C\ 

n 

Ni = }^ l{xu = 1} and mi = iV/ _1 Vi, 

i=l i:xu=l 

The proof of this lemma is postponed to |B] 



The solution to (0 can easily be computed using the PAVA (IBarlow et al.[ 1 1972c Robertson et al.L Il988h . 
This latter algorithm performs at most n — 1 iterations until the vector /3x is found. 
One of the initial motivations to analyze regression with ordered predictors, and the reason why we in- 
cluded this very specific example, was to see whether this simple and appealing structure can be carried 
forward to the more general problem of more ordered factors and additional quantitative variables. How- 
ever, since (i) we were not able to construct a generalized PAVA algorithm that solves our problem and 
(ii) we are not only interested in the least-squares problem but also treat GLMs, we switched to an active 
set algorithm. 
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6 Characterization of the solution 



There are two main purposes of providing a characterization of the estimator j3: (i) knowing the structure 
of the maximizer of £ allows one to cross-check the validity of the proposed active-set algorithm and 
to check whether it has found the correct maximizer of I. (ii) It is well-known that in such constrained 
estimation problems, the key to deriving asymptotic properties of the estimator such as consistency or rate 
of convergence is a characterization in terms of directional derivatives, see the discussion in SectionQ] 
To be able to state the following theorem properly, we introduce the function if) : { (c + 1) x C c+ \ , . . . , p x 
£p} — > {1, . . . , d} that maps the original indices (j, I) to the column number of the respective dummy 
variable in X, or equivalently, to the index i that corresponds to the entry of the vector f3 G B(c, p, k) 
that corresponds to /3y \. Specifically, this function is for any j G J CyV and / G Cj, 

j 

h=c+l 
j 

= 2c+ ( k h ^+l-j. (8) 

h=c+l 

By we denote the inverse of this function, i.e. the function that maps the position i of the entry of /3 
to the indices j and I. Now, for each j £ J c ^ v let hj be the vector of distinctive strictly positive values of 
(/3j)j g £ j for every j 6 J CiP and any j3 G £>. Using these definitions we split any vector /3 G B into the 
following blocks: 

Bi(/3) = {i : i = 1, . . . , c} (coefficients of quantitative variables), 

B 2 M = {i : ^-i W = and (^W)! = j}> 

B3,j, u (P) = {all indices i s.t. = h jiU for = j}, 

where u = 1, . . . , [hj | for each j. Here, | . | denotes the dimension of a vector a or the number of elements 
in a set. Note that \B\ \ + | Uj B 2 j \ + | Uj >M B^j :U \ = d. Using these blocks, we are now able to formulate 
the characterization of the solution. 

Theorem 6.1. An arbitrary vector 7 G B(c, p, k) maximizes the concave function t if and only if it fulfills 
the following conditions: 

(W(7)) = for all s G -81(7) (9) 

t 

( V ^)) s ^ 0, for allteB 3tjtU tf), u = l,...,\hj\ and jeJ CjP , (10) 

s=min B3,j,u(7) 
max_B 3j> (7) 

( W ^)) s ^ 0, forallt€B 3dtU (%u = l,...,\h J \andj€j e , p . (11) 
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Var 


Level 


P 




V771 


pi 


Vpi 


V-i-pi 








quant 




2 


2.13 





2.12 








2.08 








quant 




-3 


-2.95 





-2.94 








-2.96 








quant 







0.19 





0.18 








0.17 








factl 


2 


1 


1.06 





1.08 








0.88 








factl 


3 


1 


1.53 





1.41 








1.23 








ordl 


2 





-0.85 





-0.78 











-26.86 


-26.86 


ordl 


3 


2 


3.55 





1.99 


79.8 


79.8 


2.19 


79.23 


52.38 


ordl 


4 


2 


1.67 





1.99 


-13.57 


66.23 


2.19 


-12.66 


39.71 


ordl 


5 


2 


0.60 





1.99 


-65.85 


0.39 


2.19 


-65.3 


-25.59 


ordl 


6 


2 


1.94 





1.99 


-0.39 





2.19 


-1.27 


-26.86 


ordl 


7 


5 


4.41 





4.47 








4.65 





-26.86 


ordl 


8 


5 


4.55 





4.60 








4.79 





-26.86 



Table 1: Estimators, gradients and cumulative sum of gradients for Example 1. 771 is the unconstrained 
estimate, pi the constrained version without the non-negativity restriction, and f}\ the restricted and non- 
negative estimate. 

Note that the entries of the gradient at the active constraints i £ Bi,j, w& not needed to characterize 
the solution since 7 always equals at these positions. Furthermore, the theorem immediately implies 

J2 ( v ^)) s = (12) 

for u = 1, . . . , |hj I and j G Cj. 

To illustrate Theorem 16.11 consider the following example: For n = 200 observations we generated 
a dataset with standard normally distributed errors, three quantitative variables, one (unordered) factor 
(with three levels) and one ordered factor (with eight levels). The model we stipulated to generate the 
response y was 

yi = 2q u - 3q 2 i + 0q 3i + 0f u + / 2i + / 3 ; + 0o u + 0o 2i + 2o 3i + 2o 4i + 2o 5i + 2o 6i + 5o 7i + 5o 8i + e« 

where qji ~ iV(l, 2) for j = 1, 2, 3 and i = 1, . . . ,n = 200, each level of any factor (whether ordered 
or unordered) has the same number of observations and these are randomly allocated to the observations. 
Finally, ej ~ iV(0, 4) for i = 1, . . . , n. The resulting (constrained) linear regression estimates are given in 
Tabled] Note that for comparison we also added columns for the estimator pi which is computed similarly 
to Pi, but without the positivity restriction /9g,2 > 0. For this estimator, a characterization similar to that 
in Theorem |6T] can be given using exactly the same approach. 

In this example, we get the following quantities: p = 6, / = 1, c = 5, d = 12, ^5,6 = {6}, C% = 
{2,..., 8}, fc 6 = 8, and finally 

5(5,6,8) = {/3GM 12 : /3 6 ,2>0, /3 6)1+1 >/3 6jl , 1G{2,...,7}}. 
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The notation V-j-v in Table[T]is shorthand for the cumulative sum of any vector v G 



Vtv 



(£*) 



d 

k=V 



i=l 



The values of the least-squares criterion function for the three estimates are 



4,i (m 



-2964.8 £ n ,i(pi) 



-3074.8 



-3085.3 



andZ 6 (/3 1 ) = {2}. 

Let us now illustrate Theorem 16. li For either quantitative variables or dummy variables corresponding to 
unordered factors (which in our context are conceptually equivalent), the respective entry of the gradient 
V/?i is always 0. As for the ordered factor, for the entries where the positivity constraint is active (i.e. the 
elements in Zq((3 1 )), the gradient has a value which is not used (and not necessary) for a characterization 
of Pi. The sets denned above are for the simulated example: 



Bi >fl = {1,...,5} £ 2 , 6 = {6} £ 3 ,6,i 
£3,6,2 = {11} £3,6,3 = {12} h 6 



= {7,..., 10} 

= (2.19,4.65,4.79). 



These sets then yield the following inequalities, according to ([TOl and (fTTT> : 



W n ,i(/3i)J s = for sG {1,...,5} (W n ,i(/3i)J s = for s = 6 
t 10 
£(V4,l(A)) > for t G {7, . . . , 10} ^(V4,i(ft)) < for t G {7, . 



s=7 



,10} 



(V4,i(ft)) s = for sG {11,12}. 



7 Statistical inference 



Having shown how to compute estimators for i = 1,2,3, the question arises how to perform (fre- 
quentist) statistical inference in these models. Deriving consistency, rate of convergence and limiting 
distributions for estimators similar to /3j under standard assumptions is known to be non-trivial. It is 
therefore not clear how to construct e.g. confidence intervals for our estimated parameters of the ordered 
factor. By using the characterization given in Section [6j one should be able to derive rates of convergence 
and even the limiting distr ibuti on of (3 as n — > 00 in a suitably specified model, thereby generalizing 



the results of iBrunkl (119701) and IWrighn (119811) for isotonic regression to our more general setting. This, 
together with a generalization of the likelihood ratio tests introduced in Section [8] to an arbitrary number 
of ordered factors, is subject to ongoing research. 



Note t hat bootstrap is not without fallacies in these type of models, see iKosorokl (120081) and 
J2004 



Sen et al. 
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8 Testing for the presence of constraints 



There is a vast literature on likelihood ratio testing in models under linear equality and inequality con- 
straints. For a discussion and furth e r refe r ences o n (exac t) te sting under res tri ctions in t h e ord inary 
linear regression model see IPerlmanl (119691) . IWolaM (119871) and IShapirol (119881) . ISilvapulld (119941) and 
Fahrmeir and Klingen (119941) generalize these results to generalized linear models, especially logistic and 
Cox regression. Suppose a researcher wants to test the following hypotheses: 



Hq '■ Pc+1,2 



c+l,k, 



c+1 



vs. H x : (3 e B(c,c + l,kc + i) 



(13) 



Note that the estimator under Hq can be computed via an unrestricted maximization. It conesponds to a 
maximization using amodified design matrix X with the columns ip(c+l, 2), . . . , ip(c+l, k c+ \) omitted. 
Since under Hq we need to consider an unrestricted estimator, we have to constrain attention either to (i) 
only one ordered factor or (ii) a test of inclusion of all ordered factors against their entire exclusion from 
the model. The poten tial influence of th e additional ordered factor(s) on the response is assessed with H±. 
In notation similar to 



Silvapulld (|l994l) . the above hypotheses translate to 



Hq : R/3 = 



vs. 



Hi : R 2 /3 > 0, 



where here R = R 2 is the /c c+1 x d matrix chosen such that 



R 2 /3 



((0)Li, A, A c+1 - flwi-i) 



Following the development in lSilvapulld (|1994h . the likelihood ratio test statistic to test the Hypotheses 
(fT3l is defined as 



2W)-£(rjf)). 



(14) 



The distribution of Tlr is a mixture of y 2 distrib utions. The weights are in principle fully specified, 
however, in general ha rd to compute (IWolakl.ll987h . As a remedy, one can ei t her us e exact Monte Carlo 
weights (IWolakl.ll987l) or bounds on the p- value for the above test (ISilvapulld . 1 19941 . Proposition 1). 
As can be seen from (fT4l) any LRT is constructed as the difference of the likelihoods at the unrestricted and 
the restricted maximizer of the (partial ) log-likelihood f unction, which entails that one needs an algorithm 
to compute the restricted maximizer. ISilvapulld (119941 . Section 4) describes an ad- hoc approach to fi nd 
constrained estimators. However, his algorithm is non-standard and tedious to apply (ISilvapulld . 1 19941 . p. 
856). The active set algorithm described here is a general framework able to tackle general optimization 
problems under constraints and able to compute the restricted estimators in the above mentioned tests 
very efficiently. 



9 A real data example 



We illustrate our new algorithm using a data set from oncology, initially analyzed in lTaussky et all (120051) . 
The goal of the study was to assess the impact of treatment- and patient-related factors on the risk of 
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Variable 



Type 



Levels (first mentioned = baseline) 



Intercept (inter) 
Age (age) 
Treatment (tmt) 
Radiotherapy (rt) 
Sex (sex) 
Tumor stage (t) 
Nodal stage (n) 
Performance status (ps) 



constant 

continuous (standardized) 

factor 

factor 

factor 

ordered factor 
ordered factor 
ordered factor 



Chemotherapy (CT) yes, CT no 

concomitant boost (CB), hyperfractionation (HF) 

female, male 

1< 2 < 3 < 4 

1<2<3<4<5<6 

1 < 2 <"stage greater than 2" 



Table 2: Explanatory variables in real data example. 



developing a second primary tumor (SPT) of the upper aerodigestive tract within three years after initial 
therapy, in head-and-neck cancer patients. For a subset of 23 1 patients that had either been observed at 
least three years without SPT or experienced an SPT before three years, the endpoint 

SPT3 = l{The patient experienced a SPT at 3 years or before} 

was defined and modeled using multiple logistic regression. The explanatory variables are described in 
Table El 

Researchers assume in general that higher tumor stage, nodal stage, and performance status correspond 
to a higher risk of experiencing a SPT. It seems therefore appropriate to use our constrained estimator in 
this setting. In Figure [T] the unconstrained and constrained estimators 772 and fa are displayed (dot and 
triangle, respectively) as well as profile likelihood confidence intervals for rj (a = 0.05). Values of the 
likelihoods were 4,2(^2) = -101.6 and £ n ,2(fa) = -102.2. 

Estimates for quantitative predictors, i.e. those for age, treatment, radiotherapy and sex turned out to be 
very similar for 772 and fa. On the other hand, the "prior belief" or assumption of non-negative and 
increasing estimates for the levels of the ordered factors tumor and nodal stage and performance status 
was violated by the unc onstrained estimator 772 and "corrected" by fa. 



The original analysis in lTaussky et al.l(|2005l) focused on identifying factors that influence the occurrence 
of SPT. Variables were not taken into account as ordered factors, but were dichotomized. For comparison, 
we also computed the restricted and unrestricted estimates in this setting, see Table [3] It turns out that 
parameter estimates and corresponding odds ratios (OR) for the two approaches were similar, except for 
the nodal status. Note that the effect of tumor stage is reversed, compared to the case where we consider 
all factor levels (and do not only dichotomize), compare Figure [T] 
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Variable 



Type 



Levels 



771 OR ,Si OR 



Intercept 


constant 




-2.56 




-2.72 




Age 


factor 


< 57, > 57 


-0.19 


0.83 


-0.20 


0.82 


Treatment 


factor 


CT yes, CT no 


0.41 


1.51 


0.42 


1.53 


Radiotherapy 


factor 


CB, HF 


1.03 


2.81 


0.99 


2.70 


Sex 


factor 


female, male 


0.51 


1.67 


0.49 


1.63 


Tumor stage 


ordered factor 


1, > 1 


-0.21 


0.81 


0.00 


1.00 


Nodal stage 


ordered factor 


0, > 


0.26 


1.29 


0.27 


1.31 


Performance status 


ordered factor 


0, > 


0.37 


1.44 


0.40 


1.49 



Table 3: Explanatory variables in real data example, dichotomized variables as in original paper. 



LU U 



— •— Unconstrained MLE f| 2 

A 

A Constrained MLE p 2 



I I I I I I I I I I I I I I I 

inter age tmt.ct rt.hyp sex.m t.2 t.3 t.4 n.2 n.3 n.4 n.5 n.6 ps.2 ps.>2 



Figure 1: Estimates and confidence intervals for SPT example. 



10 Extensions 



It is straightforward to generalize the set B(c, p, k) to 

B'(c, p, k,r) = {/5£l d : /3 ji2 > r jj2 , /3 jjl+1 - ^ > r 1+1 , 1 G £j \ {kj},j G J c>p } 

for arbitrary real numbers /. Using such a more general parameter space could be beneficial in connec- 
tion with finding the minimu m effective dose in dos e-response models. The dose levels would then take 
the role of an ordered factor (IWang and PengJ. 120071) . Our new approach easily allows us to incorporate 
further predictors of any of the three types described in the introduction to model the response. 
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Modeling a factor with decreasing levels can be achieved by reversing the coding of the corresponding 
ordered factor, using the algorithm under the constraint of increasing levels and finally re-reversing the 
order of the estimates in the vector /3. Using this approach, it is straightforward to find the solutions for all 
combinations of possible orderings of, say, three ordered factors. By computing the value of the criterion 
function for all these resulting coefficient vectors, one can find the one with the lowes t criterion value, an 



approa ch related to finding a global maximum in the criterion function described in Ivan der Kooij et al 

(boodi 

Generalizations to further criterion functions, such as other GLMs or least absolute deviation regression 
with ordered covariates, are straightforward. As for the latter problem, we suggest s moothly approximat 



ing th e not everywhere differentiable criterion function, as previously discussed in iBeran and Dumb gen 

J2004 
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A Details of the active set algorithm 

In this section, we complement the description of the algorithm indicated in Section |2] Recall the sets of 
indices J CjP = {c + 1, . . . ,p} and Cj = {2, . . . , kj} for j G J CtP . 

In order to respect the ordinal character of each of the factors w.j we introduced in Section|2]the new data 
matrix X by adding dummy variables for the ordered factors, such that 



X 



W.i, . . . , W. c , X.^jj) 



l6£j; j€jc,p 

for dummy variables 

x ^(j,i) = (H w ij = ^})f=i> le£j, j e J c ,p. 

The function tp is given in (HJ. With the above version of coding, / = 1 is considered the reference level 
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for every ordered factor w.j and the resulting design matrix X is now an element of M. nxd where 

d = EE 1 

= c + ip{p,k p ) - ip(c+ 1,2) + 1 
= c-f+ ]T k r 

Again, we denote by Xi. the i-th row of X, i.e. the values of the "dummyfied" predictors for the i- 
th observation. In order to respect the ordinal character of each of the factors w.j we then constrain 
optimization of the updated functional L = L(y, X, j3) to the space of parameters B(c, p, k) given in (f5]). 
We write t as placeholder for any of the functions £ n ,i,^n,2, or 4i,3 (for ease of notation we omit the 
dependence on n) and the aim is to find for given response vector and matrix of predictors the vector 

/3 := argmax^(/3). 



To fit the constrained maximization problem © into the framework of iDiimbgen et al.l (120071) . we write 
the set B given in (f5]) as 

B = {P e R d : vj p < 0, i = 1, . . . , q} 
for vectors Vj 6 BL d . For ease of notation, we have enumerated the constraining inequalities 

vj> = -Pc+1,2 < 

vlp = -/3 c+ l,3 + /3 c +l,2 < 



V k c + 1 -lP - -Pc+l,k c +l + &+l,fcc+l-l ^ 

V kc+1 /? = -Pc + 2,2 < 

Vke+^l/ 3 = -/3 C +2,3+/?c+2,2 < 

V q/? = -Pp,k p +Pp,k p . 1 < 

from i = 1, . . . , q, where 

jeJcp 
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The function : {(c + 1) x £ c +i, . 
"inequality index" i is given by 



,p x £ p j — > {1, . . . , q} that maps the original indices (j, I) to the 



<P(j,l) = ( £ kh-t) +(/-l)-(j-c-l) 

h.=c+l 

= ^(j,0-c (15) 
so that the inequalities can be written as 

v 0(j,l)^ = + $J,/-1 1 {Z>3} 

< 

for / € £ p and j £ J CtP . The vectors vi for any i = <p(j, I) £ {1, ... ,q} are received via 
Vi := (l{k = c + 4>{j,l)-l}l{l>Z}-l{k = c+<P{j,l)}) q . 

\ / k=l 

Note that all these vectors are linearly independent. Define for any index set A C {1, . . . , q} the linear 
subspace 

V{A) := |/3eM d : = 0, for all a €A} 

= {/3eR d : - / 5 j) i + ) S j) i_ 1 l{l>3} = 0, for all j, 1 such that ^(j, 1) 6 a} 

and for (3 £ M d the set A of "active constraints": 

A(fi) := {ie{l,...,q} : > o}. 

Maximization on subspace The crucial assumption for an active set algorithm is that we have an algo- 
rithm available that for any A C {1, . . . , g} (efficiently) computes 

/3(A) = arg max l((3), 

/3GV(A) 

provided that V(A)Pi{f3 : £(f3) > — oo} ^ 0, see Section|4] For simplicity and without loss of generality, 
fix j = c + 1. Then, for a given /3 the following situations can cause a non-empty set V(A): 

Case Violation(s) A(f3) Corresponding set V(A) 

1 /9 c+ i,3>j8c+i,a,j8 c +i,a<0 {1} {/3 G R d : Vi /3 = 0} 

2 /3 c+ i, 2 > j8c+i,3, Avt-1,2 >0 {2} {/3GR d : V2/3 = 0} 

3 /3 c+ i,2 > /3 c +i,3, /3c+i,2 < {1,2} {/3GR d : Vi/3 = 0, V2/? = 0} 
Table 4: Possible violations of constraints within one ordered factor. 

Note that the situation v s /3 T > for any s = 3, . . . , k c+ \ can be treated analogously to Case 2 in Table|4] 
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To compute the unrestricted maximizer /3(A) in the three cases given in Tabled the strategy is to suitably 
modify the design matrix X. Precisely, we show how to construct new data matrices X' and a new 
corresponding function £\ , i = 1, 2, 3 : M. d * — > R (here, i stands for the corresponding case in Table |4} for 
a given A* C {1, . . . , q} in the three cases of Table|4]such that /3(A*) can be immediately derived from 

& = argmax^(/3). (16) 

/3eR d * 

It is crucial to realize that the maximization in ( fT6l ) is unconstrained and the following arguments show 
that d\ < d in all considered cases. In what follows, we explicitly state the unconstrained maximization 
problem, assuming that only the case under consideration is present. Apparent combinations of these 
basic strategies are necessary in case more than one of the three cases described in Table @] are present. 



Case 1 Writing down the maximization problem ([TBI explicitly, we get 



/3({1}) = argmax £(/3) 

/3c+l,2=0, 



= ((#)f =1) o, (dDf-^i) 

with 

M = argmax^(/3,X_ (c+1) ), 



where in general M_i is the matrix M with the i-th column omitted and Q) is the criterion function 
corresponding to I, but based on the design matrix Q. 



Case 2 Roughly, the strategy here is to add up the dummy variables corresponding to the violating 
constraints, compute the unconstrained maximizer and then "blow up" the resulting estimator again. To 
see this, consider 

/3({2}) = argmax £(/3) 

/3 c +l,3=/3c+l,2, 



((3*)i=l> A* c+l) c+1j 



c+2 



with 



= argmax^(/3, ((X) 1 ; i=1 , X. ; (c+1) + X. ; (c+2) , (X)^^ 



Case 3 Repeating the above computations, we derive 



/3({1, 2}) = argmax £(/3) 

/3 c+ l,3=/3c+l,2=0,/ 



63?)? =1 , 0,0, (g)f- c 2 +1 ) 



19 



where 

fil = argmax^(/3,X_ (c+ljC+2) ). 



B Proofs 

Proof of Lemma [57TI First, observe that for i = 1, . . . , n, 

l{wn = q}l{wn = r} = for 2 < q, r < k\ with q ^ r. 
The function —l n ,\ can then be written as 

fel . 2 



i=l 1=2 
n k\ k\ 

= J2(yf - 2 Vi J^^l{x u = 1} + (X>l{za = ^ 

i=l «=2 Z=2 

n fci n fci n 

= E - 2 E a E = x > + E a 2 E ^ = i} 

i=l Z=2 i=l Z=2 i=l 

fci n 

= £(/3?iV,-2A ^ y,)+E^ 2 

Z=2 i:xn = l i=l 

k\ n 

1=2 i:xn=l i=l 

k± n 

= ;>>((A-mo 2 -™f) + E^ 

Z=2 i=l 
= JVj^j - m z ) 2 + const(y, X). 

1=2 

The minimum of the latter expression under the constraint fa < • • • < can easily be found using 
PAVA. □ 

Proof of Theorem 16.1 1 Before coming to the actual proof, we state a necessary lemma. 

Lemma B.l. Let a, b G M n be two vectors having the following properties: 

j 

> for all j = l,...,n (17) 

i=l 
n 

E a * - forall k=l,...,n (18) 

i=fc 

h > bi-x for all i = 2, . . . , n. (19) 
20 



Then 

n 

aA < 0. 

i=l 

First, we prove that if 71 maximizes £ over B, then (l9l)-([T0l) are fulfilled. To this end, let t > be small 
enough and let A G M. d be a vector such that 71 + t A G B. Since 71 maximizes the concave function I 
we have 

^(7i+tA)| t=0 < 0, 

which entails 

W(ti) T A < 0. (20) 
We then get (l9l)-(fTTT) using the following perturbation functions: 

Ai = A!(c) = ±(l{s < c})f =1 , 
A 2 = A 2 (j,hj,t) = -{l{s = B 1 ,...,t}' 1 



A 3 = A 3 (j,hj,t) = (l{s = t,...,B 3 }^ 

for all t G ^3j>(7), j € i7c,p, and u = 1, . . . , |hj| and where we defined B3 = max B% t j u (7) and 
£>3 = minS3j )U (7). Now suppose we are given a vector 72 that fulfills (l9t-([TTI). We then have to show 
that 

72 = argmax ■£(/?). 
/3eB 

From convex analysis, it is well known that this is equivalent to show 

Um 1(72 + t(g- 72)) -1(72) Um 1(72 +tA)- 1(72) 

t\0 t t\0 t 

= V£(%) T A (21) 
< (22) 

for arbitrary vectors A = g — 72 such that g G B. Now compute 

V^(7 2 ) T A = ]TW(72) T (g-72) 

te 

M 

= EE E (WW(72))S - (72)s(V^(72))s) 
N |hj| 

= E E E 9 S (W(72)) S - £ £(72). £ ( W (72))s- (23) 
j'6Jc,p «=i seB 3 j iU je jc, P u=i s6B 3 JjU 

The second term disappears due to (fl2l) . As for the first term, we invoke Lemma IbTTI where ^(72) takes 
the role of a and g that of b to finally deduce that (l23l) is at most 0. □ 
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1=1 



Proof of Lemma [B7T1 First, note that (fTTT ) and (TP8l immediately imply 

n 

= 0. 

Using this, one deduces 

n n 

^aibi = y~]ai(bi - h) 



i=l 



i=2 
n-l 



(y^aijbj - bi) 

i=2 
n-l 

^a^bi - 61) 

i=2 
n-2 

= (y^ y ai(bj - 61) 

i=2 
n-2 

i=2 
n— 3 



- 01 



i=2 



+ a«(^n - 61) 

+ a n (6 n _i — &i) since a n < and due to ( fl9l ) 

+ (a„_i + a n )(b n -i - bi) 

+ (o n _i + a n ){b n - 2 ~ bi) due to CU) and (QJ 

+ (On-2 + £»n-l + «n)(On-2 ~ h). 



Repeatedly applying this same trick we finally arrive at 

n n 

^aibi = 02(62 ~ pi) + f ^ a>i) (h ~ h) 

i=l i=3 
n 

< (X)oi)(&2-6i)- 

i=2 

By means of (TPST ) and ( fl9l ) the latter expression remains non-positive. □ 

References 

Albert, A. and Anderson, J. A. (1984). On the existence of maximum likelihood estimates in logistic 
regression models. Biometrika 71 1-10. 

BACCHETTI, P. (1989). Additive isotonic models. J. Amer. Statist. Assoc. 84 289-294. 

Balabdaoui, F., Rufibach, K. and Santambrogio, F. (2009). Least squares estimation of two 
ordered monotone regression curves. J. Nonparametr. Stat., to appear . 



22 



BALABDAOUI, F. and Wellner, J. (2004). Estimation of a /c-monotone den- 

sity, part 2: algorithms for computation and numerical results. Tech. rep., Tech- 
nical report 460, Department of Statistics, University of Washington. Available at 
http ://w ww. stat. washington.edu/www/research/reports/2004/tr460.pdf. 

Barlow, R. E., Bartholomew, D. J., Bremner, J. M. and Brunk, H. D. (1972). Statistical 
inference under order restrictions. The theory and application of isotonic regression. John Wiley & 
Sons, London-New York-Sydney. Wiley Series in Probability and Mathematical Statistics. 

Beran, R. and Dumbgen, L. (2009). Least squares and shrinkage estimation under bimonotonicity 
constraints. Statistics and Computing to appear. 

Breiman, L. and Friedman, J. H. (1985). Estimating optimal transformations for multiple regression 
and conelation. J. Amer. Statist. Assoc. 80 580-619. With discussion and with a reply by the authors. 

Brunk, H. D. (1970). Estimation of isotonic regression. In Nonparametric Techniques in Statistical 
Inference (Proc. Sympos., Indiana Univ., Bloomington, Ind., 1969). Cambridge Univ. Press, London, 
177-197. 

Cheng, G. (2009). Semiparametric additive isotonic regression. J. Statist. Plann. Inference 139 1980- 
1991. 

Cox, D. R. (1972). Regression models and life-tables. /. Roy. Statist. Soc. Ser. B 34 187-220. With 
discussion by F. Downton, Richard Peto, D. J. Bartholomew, D. V. Lindley, P. W. Glassborow, D. E. 
Barton, Susannah Howard, B. Benjamin, John J. Gait, L. D. Meshalkin, A. R. Kagan, M. Zelen, R. E. 
Barlow, Jack Kalbfleisch, R. L. Prentice and Norman Breslow, and a reply by D. R. Cox. 

Dumbgen, L., Husler, A. and Rufibach, K. (2007). Active set and EM algorithms for log- 
concave densities based on complete and censored data. Tech. rep., University of Bern. Available 
at arXiv:0707.4643. 

Dumbgen, L. and Rufibach, K. (2009). Maximum likelihood estimation of a log-concave density and 
its distribution function. Bernoulli 15 40-68. 

DUNSON, D. B. and HERRING, A. H. (2003). Bayesian inferences in the Cox model for order-restricted 
hypotheses. Biometrics 59 916-923. 

DUNSON, D. B. and Neelon, B. (2003). Bayesian inference on order-constrained parameters in gener- 
alized linear models. Biometrics 59 286-295. 

Dykstra, R. L. and Robertson, T. (1982). An algorithm for isotonic regression for two or more 
independent variables. Ann. Statist. 10 708-716. 

Fahrmeir, L. and Kaufmann, H. (1985). Consistency and asymptotic normality of the maximum 
likelihood estimator in generalized linear models. Ann. Statist. 13 342-368. 



23 



Fahrmeir, L. and Klinger, J. (1994). Estimating and testing generalized linear models under inequal- 
ity restrictions. Statist. Papers 35 211-229. 

Fletcher, R. (1987). Practical methods of optimization. 2nd ed. A Wiley-Interscience Publication, 
John Wiley & Sons Ltd., Chichester. 

Gelfand, A. E., Smith, A. F. M. and Lee, T.-M. (1992). Bayesian analysis of constrained parameter 
and truncated data problems using Gibbs sampling. J. Amer. Statist. Assoc. 87 523-532. 

GERTHEISS, J. and Tutz, G. (2008). Penalized regression with ordinal predictors. Tech. Rep. 15, 
Ludwig-Maximilians-University, Munich. 

GHOSH, D. (2007). Incorporating monotonicity into the evaluation of a biomarker. Biostatistics 8 402- 
413. 

GlFl, A. (1990). Nonlinear multivariate analysis. Wiley Series in Probability and Mathematical Statistics: 
Probability and Mathematical Statistics, John Wiley & Sons Ltd., Chichester. With a foreword by Jan 
de Leeuw, Edited and with a preface by Willem Heiser, Jacqueline Meulman and Gerda van den Berg. 

Grenander, U. (1956). On the theory of mortality measurement. II. Skand. Aktuarietidskr. 39 125-153 
(1957). 

Groeneboom, P., Jongbloed, G. and Wellner, J. A. (2001). Estimation of a convex function: 
characterizations and asymptotic theory. Ann. Statist. 29 1653-1698. 

Groeneboom, P., Jongbloed, G. and Wellner, J. A. (2008). The support reduction algorithm for 
computing non-parametric function estimates in mixture models. Scand. J. Statist. 35 385-399. 

HASTIE, T. J. and TlBSHlRANl, R. J. (1990). Generalized additive models, vol. 43 of Monographs on 
Statistics and Applied Probability. Chapman and Hall Ltd., London. 

HOLMES, C. C. and Held, L. (2006). Bayesian auxiliary variable models for binary and multinomial 
regression. Bayesian Anal. 1 145-168 (electronic). 

Jamshidian, M. (2004). On algorithms for restricted maximum likelihood estimation. Comput. Statist. 
Data Anal. 45 137-157. 

Jankowski, H. K. and Wellner, J. A. (2009). Computation of nonparametric convex hazard estima- 
tors via profile methods. J. Nonparametr. Stat., to appear 21 505-518. 

KOSOROK, M. R. (2008). Bootstrapping in Grenander estimator. In Beyond parametrics in interdisci- 
plinary research: Festschrift in honor of Professor Pranab K. Sen, vol. 1 of Inst. Math. Stat. Collect. 
Inst. Math. Statist., Beachwood, OH, 282-292. 

Matthews, G. and Crowther, N. (1998). Theory and methods a maximum likelihood estimation 
procedure for the generalized linear model with restrictions. South African Statist. J. 32 1 19-144. 



24 



Morton-Jones, T., Diggle, P., Parker, L., Dickinson, H. O. and Binks, K. (2000). Additive 
isotonic regression models in epidemiology. Stat. Med. 19 849-859. 

Mukerjee, H. and Tu, R. (1995). Order-restricted inferences in linear regression. /. Amer. Statist. 
Assoc. 90 717-728. 

Perlman, M. D. (1969). One-sided testing problems in multivariate analysis. Ann. Math. Statist. 40 
549-567. 

R DEVELOPMENT CORE TEAM (2009). R: A Language and Environment for Statistical Computing. R 
Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. 

URL http : / /www . R-pro ject . org 

Robert, C. P. and Hwang, J. T. G. (1996). Maximum likelihood estimation under order restrictions 
by the prior feedback method. J. Amer. Statist. Assoc. 91 167-172. 

ROBERTSON, T., Wright, F. T. and DYKSTRA, R. L. (1988). Order restricted statistical inference. 
Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics, John 
Wiley & Sons Ltd., Chichester. 

RUFIBACH, K. (2007). Computing maximum likelihood estimators of a log-concave density function. J. 
Statist. Comp. Sim. 77 561-574. 

RUFIBACH, K. (2009). OrdFacReg: Least squares, logistic, and Cox-regression with ordered predictors. 
R package version 1.0.1. 

URL http : / /www .biostat.uzh. ch / about us/people/ruf ibach . htrnT] 

Santner, T. J. and Duffy, D. E. (1986). A note on A. Albert and J. A. Anderson's conditions for the 
existence of maximum likelihood estimates in logistic regression models. Biometrika 73 755-758. 

Sen, B., Banerjee, M. and Woodroofe, M. B. (2009). Inconsistency of bootstrap: the grenander 
estimator. Ann. Statist., to appear. 

Shapiro, A. (1988). Towards a unified theory of inequality constrained testing in multivariate analysis. 
Internat. Statist. Rev. 56 49-62. 

SlLVAPULLE, M. J. (1994). On tests against one-sided hypotheses in some generalized linear models. 
Biometrics 50 853-858. 

SlLVAPULLE, M. J. and BURRIDGE, J. (1986). Existence of maximum likelihood estimates in regression 
models for grouped and ungrouped data. J. R. Stat. Soc. Ser. B Stat. Methodol. 48 100-106. 

Steyerberg, E. W. (2009). Clinical Prediction Models. Springer. 

Tan, M., Tian, G.-L., Fang, H.-B. and No, K. W. (2007). A fast EM algorithm for quadratic opti- 
mization subject to convex constraints. Statist. Sinica 17 945-964. 



25 



Taussky, D., Rufibach, K., Huguenin, P. and Allal, A. (2005). Risk factors for developing a 
second upper aerodigestive cancer after radiotherapy with or without chemotherapy in patients with 
head-and-neck cancers: an exploratory outcomes analysis. Int. J. Radiat. Oncol. Biol. Phys. 62 684- 
689. 

TAYLOR, J., WANG, L. and Li, Z. (2007). Analysis on binary responses with ordered covariates and 
missing data. Stat Med 26 3443-3458. 

Terlaky, T. and Vial, J.-R (1998). Computing maximum likelihood estimators of convex density 
functions. SI AM J. Sci. Comput. 19 675-694 (electronic). 

VAN DER KOOIJ, A. J., MEULMAN, J. J. and HEISER, W. J. (2006). Local minima in categorical 
multiple regression. Comput. Statist. Data Anal. 50 446-462. 

VAN ROSMALEN, J., KONING, A. and GROENEN, R (2009). Optimal scaling of interaction effects in 
generalized linear models. Multivariate Behavioral Research 44 59-81. 

Wang, W. and Peng, J. (2007). An algorithm to estimate monotone normal means and its application 
to identify the minimum effective dose. Tech. rep. Available at arxiv.org:0801.0079. 

WOLAK, F. A. (1987). An exact test for multiple inequality and equality constraints in the linear regres- 
sion model. /. Amer. Statist. Assoc. 82 782-793. 

Wright, R T. (1981). The asymptotic behavior of monotone regression estimates. Ann. Statist. 9 443- 
448. 



26 



